-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dask namespace concat
method
#840
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey thanks for the PR, that was fast :)
I left a couple of comments, the main concern is about triggering computation on behalf of the user, which we would like to avoid as much as possible.
I would also double check the behaviour of the default join="outer"
argument for dd.concat.
Finally, regarding the missing test, feel free to add it to get higher/better coverage of edge cases.
827c90f
to
436c323
Compare
Cheers @FBruzzesi! Think you're right around the behaviour of "outer" by default. In theory, I think it only comes into play on "vertical" merges (otherwise, mismatching indexes throw errors) where columns are checked to be matching already, but that sounds a little like an error waiting to happen. Just looked at adding another test, but looks like its totally fine to horizontally concatenate different length frames outside of dask. For now, I've taken out the check for matching lengths- it'll avoid any computing for figuring out length, and pass over handling the error to dask, which is probably preferable? (let me know if I'm missing a trick on how to check concats are likely to be valid though) |
Also, not very related, but quick question on tests! Looks like ubuntu tests are failing at the moment because they require 100% coverage and there's something like 99.7% (macos requires 90% and windows 95%). Is it intentionally that they're all different? If it's be handy I can bundle updating to make them all match in with this PR. |
Hey sorry I am from mobile, so this could be a sloppy comment. I will try to do my best.
Alright, to me the join keyword resonates more on a horizontal concat but apparently not. I am happy to keep the check on the column names.
Maybe we can just let dask run its behavior without manually checking for dataframe lengths. This is what we are doing for pandas, yet I would try to see how Polars LazyFrame behaves (i.e. if it throws an error, and if it does it most likely happens only at collect time). |
Thanks @FBruzzesi, comment is really helpful! At the moment, the changes I've made since my initial commit (have squashed the commits together for neatness):
On the |
Thanks! I am still a bit unsure about the difference. I can run some tests later today or most likely tomorrow, unless you have some examples ready to show the difference in behavior.
I can still see the check using
We recently changed all those (check #788), so I would suggest to keep it as before: - dd = get_dask_dataframe()
+ import dask.dataframe as dd # ignore-banned-import
Back to the first point |
a8560c4
to
9700bf9
Compare
Well that's embarrassing! Looks like my "neat squashing" was just a "neat deleting a commit" for the validation check removal π« All taken out now. Ok, I've done some playing around and actually looks like dask can concatenate on axis 1 with different lengths anyway (I don't know if this works with expression based columns though, definitely was getting an assertion error previously). That's pretty handy because it makes the outer/inner behaviour really clear for both! I'll explain them here, and hopefully you can give me a steer on the desired Narwhals behaviour? (I think I know but not 100% sure) import dask.dataframe
left = dd.from_dict({"a": [1, 2], "b": [1, 2]}, npartitions=1)
right = dd.from_dict({"a": [1, 2, 3], "b": [1, 2, 3], "c": [1, 2, 3]}, npartitions=1)
# inner on axis 0 (vertical)
dd.concat([right, left], axis=0, join="inner").compute()
# a b
# 0 1 1
# 1 2 2
# 2 3 3
# 0 1 1
# 1 2 2
# outer on axis 0 (vertical)
dd.concat([right, left], axis=0, join="outer").compute()
# a b c
# 0 1 1 1.0
# 1 2 2 2.0
# 2 3 3 3.0
# 0 1 1 NaN
# 1 2 2 NaN
# inner on axis 1 (horizontal)
dd.concat([right, left], axis=1, join="inner").compute()
# a b c a b
# 0 1 1 1 1 1
# 1 2 2 2 2 2
# outer on axis 1 (horizontal)
dd.concat([right, left], axis=1, join="outer").compute()
# a b c a b
# 0 1 1 1 1.0 1.0
# 1 2 2 2 2.0 2.0
# 2 3 3 3 NaN Nan Hopefully that makes it a little clearer? Think I muddied the water a little by being confused before on the behaviour for Looking at the other examples in narwhals I think that when I've implemented the above behaviour for now - would appreciate feedback on if that's logical though! π |
Awesome, that makes it much more clear and saved me a bunch of time, thanks a lot. I am checking polars LazyFrame behavior, which is what we should aim to replicate: Edit: TL;DR
|
Update narwhals/_dask/namespace.py Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> import change inner kwarg doh! dynamic join dumb typo validation
Thanks @FBruzzesi that's very helpful! I've made a couple more changes off the basis of that:
I've tested adding in some new tests, which I this would be ideal, here's the tests I added: # extra condition for vertical concat to check that column order is required
with pytest.raises((Exception, TypeError)):
reversed_data_right = dict(reversed(data_right.items()))
reversed_df_right = nw.from_native(constructor(reversed_data_right)).lazy()
nw.concat([df_left, reversed_df_right], how="vertical").collect() Only that actually doesn't throw an error for most of them. And then the same again with: # extra condition to check that duplicate columns are banned
with pytest.raises(Exception, match="occurence"):
duplicate_columns = nw.from_native(constructor(data_right | data))
nw.concat([df_left, df_right], how="horizontal") Which also doesn't throw an error. I'm down to try and address this for other dataframe types if Narwhals want to be strict on those conditions, but I guess it'd probably be better as a seperate issue/pr, so I've just left out those tests for now. Edit: I think you found the reason I was seeing inconsistent behaviour with dask! Initially I was trying on the latest version, but the examples I shared where from an installation of 2024.2 before expressions was opt-out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adjusting on all fronts and already planning some further tests.
I commented on some error raise since:
- in other dataframe we raise AssertionError for empty list of items
- other errors are certainly not type related, so we can keep
AssertionError
for now.
I think we can merge after that :)
Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>
Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>
Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>
Awesome thanks @FBruzzesi - I've made those changes (as an FYI I also raised an issue with over on dask/dask to check around the expected behaviour for concatenating on |
Co-authored-by: Aidos Kanapyanov <65722512+aidoskanapyanov@users.noreply.github.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic is 100% in place, there are a couple of issues with coverage for:
- the new checks we discussed
- the variables introduced for mypy type hints
I commented on how to deal with these cases - as soon as all CI is green we are ready to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to ship it? π Thanks @benrutter
nice one, thanks all! π |
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.
I've added in a
concat
method returning aDaskLazyFrame
. One thing that wasn't tested for theconcat_test
was whether an error gets thrown trying to horizontally concat frames of different length. I'm not 100% sure what happens within polars/pandas when you do this, but dask will throw a confusing assertion error when you try to access anything within the dataframe, so I thought putting in a handy check here was worthwhile.